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Abstract 

Exact inference in the linear regression model with spike and slab priors is often 
intractable. Expectation propagation (EP) can be used for approximate inference. 
However, the regular sequential form of EP (R-EP) may fail to converge in this 
model when the size of the training set is very small. As an alternative, we propose 
a provably convergent EP algorithm (PC-EP). PC-EP is proved to minimize an 
energy function which, under some constraints, is bounded from below and whose 
stationary points coincide with the solution of R-EP. Experiments with synthetic 
data indicate that when R-EP does not converge, the approximation generated by 
PC-EP is often better By contrast, when R-EP converges, both methods perform 
similarly. 



1 Introduction 



Exact Bayesian inference is often intractable in many probabilistic models of practical interest. The 
computational cost of marginalization operations scales exponentially in the number of variables 
or these operations require to compute high-dimensional integrals that do not have a closed-form 
analytical solution. In practice, we have to use some form of approximation. Approximate inference 
in large applications is frequently implemented using deterministic methods. These have often less 
computational cost than other alternatives such as sampling. Deterministic methods approximate 
the exact posterior by a tractable parametric distribution whose parameters are selected by solving 
optimization problems. Expectation propagation (EP) is one of the most successful techniques for 
deterministic approximate inference [l]. However, a disadvantage of EP is that, in its standard 
sequential form, it is not guaranteed to converge. 

In this work we focus on the linear regression model with spike-and-slab priors (LRMSSP). The 
standard sequential EP method often generates in this model very accurate approximations of the 
posterior distribution |2|. However, EP may fail to converge in some extreme cases in which the 
number of training instances is very small. To avoid this, we introduce a provably convergent EP 
algorithm (PC-EP) for approximate inference in the LRMSSP. PC-EP is based on the double loop 
algorithm described in |3 1. Each outer iteration of PC-EP is proved to minimize an energy function 
whose stationary points coincide with the solution of regular EP |4|. The main difference between 
PC-EP and the algorithm proposed in |3| is that PC-EP constrains the variance parameters to be 
positive. This ensures that the energy function minimized by PC-EP is bounded from below. The 
boundedness of this function is a necessary condition to guarantee convergence. Experiments with 
synthetic data illustrate the advantages of PC-EP over regular EP. In these experiments, when regular 
EP does not converge, the posterior approximation generated by PC-EP is often better. By contrast, 
when regular EP converges, both methods perform similarly. 
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2 The Linear Regression Model with Spike-and-slab Priors 



We focus on the problem of Bayesian inference in the linear regression model with spike-and-slab 
priors (LRMSSP). Consider n feature vectors, with d real components each, encoded by the design 
matrix X = (xi, . . . , x„)^ and associated target values y = {yi, . . . , y„)^ e R". The LRMSSP 
assumes that y = Xw + e, where w is an unknown d-dimensional vector of real coefficients and 
e is an n-dimensional vector of independent Gaussian noise with variance a^. Given X and y, the 
likelihood of w is a Gaussian function: P{y\w, X) = A/'(y |Xw, ct^I). The prior for w is a product 
of spike-and-slab factors [5J: 

d 

^(W) = n [PsJ^{w^\0. Vs) + (1 - Ps)S{w,)] , (1) 
i=l 

where Af{-\0, Vg) denotes a Gaussian density with zero mean and variance Vg (the slab), S{-) is a 
delta function centered at zero (the spike), and ps is the prior probability that any of the components 
of w is different from zero. The LRMSSP has special practical interest in regression problems with 
more features than training instances (that is, d » n). In this scenario, the assumption of sparsity 
is used to reduce over-fitting by limiting the number of influential features |6|. The spike-and-slab 
prior ([T]i incorporates this assumption by inducing a bi-separation in the model coefficients. A small 
number of coefficients are considered to be different from zero. These are the coefficients sampled 
from the slab. At the same time, a reduced number of coefficients are considered to be exactly zero. 
These are the coefficients sampled from the spike. 

Given X and y, the posterior distribution for w can be computed using Bayes' rule: 

^(y|x) 

The central operation in the application of Bayesian methods is the computation of marginalizations 
or expectations with respect to this distribution. However, these operations are usually intractable. 
As a solution, we use EP |4| to produce a deterministic approximation to ^ which is tractable. The 
numerator in the right part of (|2]i can be written as 

d 

V{y\^, X)V{^) = A/'(y|Xw, a^I) , (3) 

1=1 

where fi{wi) = pAf{wi\0, Vg) + (1 — p)S{wi). EP replaces each fi by a Gaussian factor fi{wi) = 
exp {viiWi — '^V2iw1 + Zi], where vu, V2i and Zi are free parameters. Let Q(w) be the product 

of the likelihood A/^(y|Xw, cr^I) and the Gaussian factors fi, . . . , fd- Then Q is an unnormaUzed 
multivariate Gaussian distribution: 

d 

Q(w) = AA(y|Xw, a^I) J] /,(m,) cx A/-(w|m, A-^) , (4) 

4=1 

where A = o-^^X^X + diag(v2), m — A^^(vi + cr^^X^y), vi = (wn, . . . , vidY and V2 = 
{v2i, . . . , W2d)^. Note that the posterior (|2]i is obtained when (O is normalized so that it integrates 
one. Similarly, the EP approximation to the posterior is given by the normalized version of Q. 

The sequential implementation of EP iteratively updates the fi until reaching a stationary point of a 
specific energy function |4 |. However, the EP update operations do not guarantee the minimization 
of this energy function and the method may sometimes fail to converge. Non-convergence can 
be prevented by damping the EP update operations |7|. After performing one update operation, 
damping sets the updated factor fi to a log-convex combination of the factor before (old) and after 
(new) the update, that is, fi = {fi^Yiff'^)^^'^, with t G [0, 1]. However, even when damping is 
used, EP can fail to converge in some extreme cases in which the number of training instances is 
very small. When this happens, we can limit the maximum number of iterations of the algorithm and 
hope that the resulting approximation is accurate enough. As an alternative, we present a provably 
convergent EP algorithm (PC-EP). 
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3 Provably Convergent EP for the LRMSSP 



The EP solutions in the LRMSSP are in a one-to-one correspondence with stationary points of the 
following objective |i8i|9J: 

min max E{v, v, v) , (5) 

where v = {vi, V2}, v = {vi, V2}, v = {vi, V2}, the elements in these tuples are d-dimensional 
real vectors satisfying 

Vi = Vi + Vi , V2 = V2 + V2 , (6) 

E{v, V, v) is an energy function given by 

E{v, v,v)^ - log Z{d) - log Z{v) + log Z{v) (7) 
and Z{v), Z{v) and Z{v) are the following normalization constants: 

Z(i) = J 7V^(y|Xw, cr^I) f[ exp jwHrn; - ^W2iwf | dw , (8) 
Ziv) = 11/ I ^i^'^' ~ I fi{wt)dwi , (9) 

Z{v) j "^XP l^li^^i - ^"2i«^i^| dWi , (10) 

where Vi = {vu, vmY, V2 = (u2i, . • . , hd)'^, vi = (vu, vm)'^, ^1 = (w2i, ■ • ■ , V2dY 
and the vectors vi and V2 contain the natural parameters of the marginals of Q. Once a stationary 
point of ^ is found, the optimal value of zi , . . . , can be computed very easily as a function of 
the solution for Vi and V2. See ifTOl for further details. To find a stationary point of the objective 
energy, we follow f3l and attempt to find a minimum of this function. However, further constraints 
have to be imposed on V2, V2 and V2 to guarantee that the energy E[y, v, v) is lower bounded. 

3.1 Lower Bound on the Objective Energy 

Consider, in addition to constraints the following inequality constraints on V2, V2 and V2: 

-V2 h s , V2 ^ e , V2 ^ 3£ , (11) 

where £ is a d-dimensional vector whose components are all equal to a small positive constant e. 
Then, the objective (|5j is bounded from below. For any vi and V2 such that V2 >z 3£, we can choose 
Vi = vi = 0.5vi and V2 = V2 = 0.5v2. These parameter values satisfy the required equality and 
inequality constraints and give a lower objective than the maximizers of E{v, v, v) when v is held 
fixed. This is more compactly written as v — 0.5v and v = 0.5v, were 0.5w = {0.5vi, 0.5v2}. For 
these values of v and v we can lower bound each term of the EP energy: 

7 / 2 1 \ 

-logZ(0.5«)>^log(27ra2)--log(47r)-^f^--logT;2,j , (12) 
log Z{v) > ^ log(2^) + ^ A _ 1 log , (13) 

i—1 ^ ^ 

-logZ(0.5«)>--log(47r)-^f|l^--logt;2J -J^logif., (14) 

with — log > 1/2 log(47r) — 1/2 logV2i- Since V2 >r 3e, all these bounds are real and positive and 
their sum is n/2 log(27rf7^) — d/2 log(2), which is a lower bound on the energy function minimized 
by the convergent EP algorithm. Note that if the components of V2, V2 and V2 could be negative, 
some of the normalization constants (O, (|9]l and (fTOl i might be infinite and the EP energy would be 
unbounded. The inequality constraints in (fTTT i guarantee that this cannot occur 
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3.2 Provably Convergent Algorithm 



The first step of the provably convergent EP algorithm (PC-EP) maximizes E{v, v, v) with respect 
to V and v given the current value of v, which is denoted by w*^*^ — {vj*^ , v^'' }, that is, 

max E{v'-*\v,i)) . (15) 

V, V 

s.t dsii, (dUi 

This step is implemented using standard optimization methods such as Quasi-Newton techniques, 
optimizing only on either u or w since ^ allows us to identify one of these tuples given the other 
and the current value of v. The Lagrangian for this optimization problem is 

A, m) = E{v^'^ ,v,d) + - VI - VI ) + Xlivi'^ - V2 - V2) 

+ /2{(e-V2) + /x|(e-V2), (16) 

where A = {Ai, A2}, /i = {/ii, /i.2} and Ai, A2, and fi2 are d-dimensional vectors of Lagrange 
multipliers that satisfy /xi ^ and fi2 d: 0. The corresponding dual function is 

g{v^*\X,n)^ max C{v^*\v,v,X,^i) (17) 

V, V 

The dual problem minimizes this latter function with respect to the Lagrange multipliers: 

5(«(*))= min g{v'^'\X,fi) (18) 

A, /i 

s.t. /^l ^ 0, /i.2 ^ 

Let \*, Aj, and be the components of A and /i that minimize the objective in ( fTSl l under 
constraints /xi ^ and fi2 d: 0. The second step of the convergent algorithm solves the convex 
problem 

min A^'^vi + A^'^V2 + log Z{v) . (19) 

V 

s.t. V2 >r 3e 

Let = {vi*^^), V2*^^'} be the solution to this optimization problem. When the constraint 

on V2 is not tight, we have that Vj*^^'' = (SAj — A^ o A|)^^ and v^*^^' — —A* o V2*^^'', where 
the operator "o" denotes the Hadamard element-wise product and the inverse of a vector is defined 
as a new vector whose components are the inverse of the components of the original vector When 
the constraint is tight, the solution is still given by these formulas. However, the components of V2 
which result to be smaller than 3e (those for which the constraint is active) must now be equal to 3e. 

We now prove that the two steps of PC-EP, that is, ( fTSl l and ( fT9] l, always generate a reduction in the 
value of ,v,v). The proof shown here is similar to the one given in |3 | for the unconstrained 

case. The objective in (fTsT i is concave and the inequality constraints are affine. Thus, the weak form 
of Slater's condition is satisfied and strong duality holds. This means that the gap between the dual 
and the primal problems is zero, that is, (/(w*^*^) = See IJU for further references. Let 

A* — Aj} and /i* = /i2}- Then, we have 

max C{v'-^\v,v,X\fj.*) > max C{v'-^+^\v,v,X\n*) , (20) 
v,v v,v 

where the first equality follows from strong duality and the following inequality is obtained because 
in the second step of PC-EP we are always minimizing and consequently, C{v'^*'\ v, v, A*, /i*) will 
never increase when we replace u^*) by Continuing with the derivations, we obtain 

m2& C{v'^'+^\v,v,X\^l*)> max C{v'^'+^\v,v, X* , n*) > E{v^^'+^'>) , (21) 
v,v v,v 

s.t ©, O 

where the first inequality is obtained because adding extra constraints in a maximization problem can 
never produce higher values of the target function. The second inequality is obtained by expanding 
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£(z;(*+^), w, V, A*, fi*) according to ( fTSI l, using the fact that fJ,J{e — V2) and /x^(e — V2) can only 
be positive and finally, using the definition ( fTsT l. 

Note that we still need the vectors A{ and for the practical implementation of the second step of 
PC-EP. Let V* = {v*, V2} and v* — {v*, Vj} be the solution to ( fTSl ). Then, the gradient of C with 
respect to the elements in v and v should be zero at v — v*, v — v* , X = X* and /j, = /j,*. This 
generates the equations 

Al, = -EqK]--EpK], A*, + K. -^EpK^], A*,+/i2. = ^EqK^], (22) 

for i = 1, . . . , d, where Eg denotes expectation with respect to the normalized version of Q and Ep 
denotes expectation with respect to P{wi) oc exp{viiWi — ■^V2i'wf}fi{wi). The inequalities ^2 h £ 
and V2 h £ can only be active on either 1)2 j or Wjj, but not on both at the same time. The reason for 
this is that V2 and V2 must satisfy V2 + V2 = V2 and V2 satisfies V2 >z 3e. This means that, for each 
i, only either /i^^ or /ij^ can be different from zero, but not both at the same time. Therefore, when 
= s, we have that = and Ajj = l/2E-p[?i;f]. On the other hand, when -Ojj — e, we have 
that fi2i = and Ajj — l/2Eg[wf]. Finally, when no inequality constraints are active in and V2, 
that is, nl ^0 and /x^ = 0, we have that X^i = l/2Eg[wf] l/2Ep[wf] for i = 1, . . . , d. The 
expectations with respect to Q are given by 

EgH-m*, Eq[w^] = {A*)-^\ (23) 

for i = 1, . . . ,d, where A* = cj-^X^X + diag(v5) and m* = (A*)-i(v| + cj-^X^y). For 
computing the expectations with respect to V, we define 

= log(«*,) - I log{{v*r' + + K - ((*2.)-' + vs)-'] , (24) 

V*- 

ai = a{pi + Ps) ^ , l\ hc7{-pi- ps)v*^, (25) 

1 + V^iVs 

h = o{p, + Ps f*'^'f'll\[; + <r{-p. - Ps) Wu? - vU • (26) 

where ps = log(ps) ^ log(l —Ps) and a denotes the logistic function. The expectations with respect 
to V are then given by 

EpK] = (v*2^r\v\, - a,) , EpK^] = - ({;2^)-2(a2 - h,) + EpK]^ , (27) 

for i = 1, . . . , d. See lIU for the derivation of these formulas. 

The cost of PC-EP is determined by the computation of the diagonal of the covariance matrix of the 
posterior approximation, that is, matrix A^^ in (|4|. When the number d of features is larger than 
the number n of training instances (that is, d ^ n), we can use the Woodbury formula to compute 
A~^ in 0{dn'^ ) operations. This is the same computational cost of regular EP |2 | since this method 
also has to compute the diagonal of this matrix on each iteration. However, PC-EP computes the 
marginal variances of the posterior approximation each time that the gradient of the objective in ( fTSb 
needs to be evaluated. This means that the multiplicative constant hidden in the cost of PC-EP is 
about two orders of magnitude larger than the one hidden in the cost of regular EP. Nevertheless, by 
applying the optimization algorithm described in |,9J to PC-EP, this method can actually become as 
fast as regular EP. This is left as future work. 



4 Experiments 

The performance of regular EP (R-EP) with damping to improve convergence is compared with the 
proposed provably convergent algorithm (PC-EP). For this, we consider a synthetic linear regression 
problem with d = 25 dimensions and w sampled according to ([T). Only 5 components of w are 
different from zero on average (ps — 0.2). These are drawn from a standard Gaussian distribution 
{Vs — 1). The attribute vectors are uniformly sampled from the unit hyper-sphere. The targets are 
sampled according to yi = w'^Xi + e^, where ^ A/'(0, a'^) and a — 0.005. Using these settings, 
we generate 100 training and test sets with 10 and 1000 instances, respectively. 

R-EP and PC-EP are run on each training set and their prediction performance is evaluated on the 
corresponding test set. The method R-EP is run for a maximum of 1000 iterations using the same 
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Figure 1 : Left, average mean square error (MSB) of R-EP and PC-EP on the test sets for which 
R-EP does not converge on the associated training sets. Right, average MSE of R-EP and PC-EP on 
the test sets for which R-EP does converge. 
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constraints (fTTT l enforced in PC-EP. Different values are considered for the damping parameter: 
T 6 {0.1,0.3,0.5,0.7,0.9}. The table in the left-hand side of Figure [T] shows, for each value 
of T, the average mean square error (MSE) of each method on the test sets for which R-EP does 
not converge on the associated training sets. The number of these test sets is also displayed in the 
table. The other table in the right-hand side of the figure shows the same information for those sets in 
which R-EP does converge. These results indicate that, when R-EP does not converge, the prediction 
errors of this method are typically larger than when it does converge. However, more importantly, 
when R-EP does not converge, the posterior approximation generated by PC-EP is better than the 
one obtained by R-EP. On the other hand, when R-EP does converge, the prediction errors of both 
methods are significantly lower and both techniques obtain similar performances. 

5 Conclusions 

The LRMSSP is interesting in under-determined scenarios with n <C (i since the spike-and-slab prior 
can help to alleviate over-fitting in these cases. Nevertheless, approximate inference in this model 
can be difficult. More precisely, expectation propagation (EP) may have convergence problems in 
some extreme cases and one is forced to limit the maximum number of iterations of the algorithm. 
In this work, we have proposed a provably convergent EP algorithm (PC-EP) for the LRMSSP. 
Each iteration of PC-EP is proved to minimize an energy function whose stationary points coincide 
with the solution of regular EP (R-EP). We also introduce constraints on the parameters of the EP 
approximation so that this energy function is bounded, which guarantees convergence. Experiments 
with synthetic data illustrate the advantages of PC-EP over R-EP. When R-EP does not converge, the 
posterior approximation given by PC-EP is usually better. By contrast, when R-EP does converge, 
both methods perform similarly. 
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